Explorations of the Orcas distribution

Orcas
Location
Map
Author

Caoyu Shao

Published

August 18, 2025

Figure 1: Orcas distribution exploration

Temporal Patterns

The dataset spans 2017-01-05 to 2024-10-06, totaling 2,832 days (≈ 7.75 years).

Figure 1. Monthly time-series of orca encounters (x-axis: YYYY-MM; y-axis: encounters per month), showing month-to-month variation in observations.

Figure 2. Boxplots of monthly encounter counts across years (x-axis: month abbreviation; y-axis: encounters). Each box summarizes the distribution of counts for that month over all years in the dataset.

Result. Both figures indicate a pronounced seasonal peak in September. Across years, the median number of encounters in September is ≈18, whereas the median for other months is generally <10, indicating substantially higher activity in September.

Figure 3. Line chart of the share of annual encounters occurring in September, by year (x-axis: year; y-axis: September share of the yearly total).

Interpretation. The September share varies across 2017–2024 but never falls below ~15%, which is well above September’s calendar share of the year (~8.3%). Taken together with Figures 1–2, this indicates a clear seasonal pattern with a consistent peak in September.

Figure 4. Histogram of encounter durations (x-axis: duration in 1,000-second bins; y-axis: number of encounters).

Interpretation. The distribution is right-skewed with a long tail of very long encounters. Most encounters fall between 1,000 and 8,000 seconds, with the modal bin at 3,000–4,000 seconds (86 encounters).

Spatial Patterns

Figure 5. Map of orca encounter locations.

Interpretation. Encounters are not uniformly distributed; instead, they show clear spatial clustering with noticeable gaps between clusters. The map also highlights an outlying point north of 50.0° N, which may reflect either a genuine distant sighting or a data-entry/positioning error. Overall, Figure 5 provides a concise overview of the spatial pattern of encounters.

Figure 6. Hexbin heat map of encounter hotspots (hex size ≈ 5 km). Darker hexagons indicate higher encounter counts.

Interpretation. Consistent with Figure 5, encounters are concentrated within 48.0°–49.0° N and 123.5°–123.0° W, indicating a clear spatial core rather than a uniform distribution.

Interpretation. The reproducible map clarifies the coastline geometry behind Figure 5: the arc-shaped pattern of encounters follows the bay shoreline, with the majority of sightings clustered within the bay rather than offshore. This indicates that the observed spatial pattern is largely coastline-constrained.

Summary of Encounters

Figure 7. Top 20 observers by number of encounters. Dave Ellifrit records the most encounters. Mark Malleson ranks second with ~300 encounters—roughly twice the third-ranked observer. Together, Dave Ellifrit and Mark Malleson stand out as the most active observers in the dataset.

Figure 8. Top 10 vessels by number of encounters. Orcinus records the most encounters overall, while Mike1 ranks second with ~300 encounters. Both Orcinus and Mike1 clearly stand out as the most active vessels in the dataset.

Text Exploration of Encounters

From Figure 9, what we can get is that in the encounter summaries, the most frequent word is south, the second frequent word is island. “south” appears 1645 times and “island” appears 1633 times.

# A tibble: 20 × 2
   bigram               n
   <chr>            <int>
 1 san juan           389
 2 snug harbor        339
 3 haro strait        275
 4 race rocks         240
 5 juan island        205
 6 hundred yards      195
 7 heading north      179
 8 de fuca            168
 9 juan de            168
10 victoria harbour   167
11 half mile          160
12 heading south      148
13 morning star       139
14 island shoreline   136
15 false bay          135
16 kellett bluff      127
17 constance bank     117
18 quarter mile       112
19 boundary pass      109
20 lime kiln          106

From the table, the most frequent bigram in the encounter summaries is “San Juan” (389 occurrences), followed by “Snug Harbor” and “Haro Strait.” These terms refer to place names in the study area: “San Juan” denotes the San Juan Islands (an archipelago), Snug Harbor is a locality on San Juan Island, and Haro Strait is the strait adjacent to San Juan Island.

Resources

The materials used for this report are:

Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686.

Wickham H (2023). conflicted: An Alternative Conflict Resolution Strategy. R package version 1.2.0, https://CRAN.R-project.org/package=conflicted.

Ryan J (2025). orcas: Scrape and Visualize Orca Sighting Data. R package version 0.0.0.9000, commit 08b3808ee4f5c9f1a25cbedea9e9d8316322ed1c, https://github.com/jadeynryan/orcas.

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.4, https://CRAN.R-project.org/package=dplyr.

Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL https://www.jstatsoft.org/v40/i03/.

Wickham H, Pedersen T, Seidel D (2023). scales: Scale Functions for Visualization. R package version 1.3.0, https://CRAN.R-project.org/package=scales.

Wickham H (2023). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.5.1, https://CRAN.R-project.org/package=stringr.

Pebesma, E., & Bivand, R. (2023). Spatial Data Science: With Applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016

Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009

Hahsler M, Piekenbrock M (2025). dbscan: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms. R package version 1.2.2, https://CRAN.R-project.org/package=dbscan.

Dunnington D (2023). ggspatial: Spatial Data Framework for ggplot2. R package version 1.1.9, https://CRAN.R-project.org/package=ggspatial.

Simon Garnier, Noam Ross, Robert Rudis, Antônio P. Camargo, Marco Sciaini, and Cédric Scherer (2024). viridis(Lite) - Colorblind-Friendly Color Maps for R. viridis package version 0.6.5.

Pedersen T (2025). ggforce: Accelerating ‘ggplot2’. R package version 0.5.0, https://CRAN.R-project.org/package=ggforce.

Cheng J, Schloerke B, Karambelkar B, Xie Y (2024). leaflet: Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library. R package version 2.2.2, https://CRAN.R-project.org/package=leaflet.

Wickham H, Vaughan D, Girlich M (2024). tidyr: Tidy Messy Data. R package version 1.3.1, https://CRAN.R-project.org/package=tidyr.

Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi:10.21105/joss.00037 https://doi.org/10.21105/joss.00037, http://dx.doi.org/10.21105/joss.00037.

The links to my use of ChatGPT for help on this assignment are: